Natural-language processing system using a large corpus

ABSTRACT

A computer-parsing system based upon using vectors (lists) to represent natural-language elements, providing a robust, distributed way to score grammaticality of an input string by using as a source material a large corpus of natural-language text. The system uses recombining of asymetric associations of syntactically similar strings to form an the vectors. The system uses equivalence lists for your the organization subparts of the string to build equivalence lists for our the province longer strings in an order controlled by the potential these/parse to be scored. The power of recombination of Entries from: vector elements in building longer strings provides a means of representing collocational complexity. Grammaticality scoring is based upon the number and similarity of the vector elements.

BACKGROUND

[0001] The present invention relates to the field of parsing andinterpreting natural language text. Deciding which combinations ofelements are possible in a natural language (language modeling),deciding what the syntactic relationships are between elements in agiven language string (parsing), and deciding the information, or evenrepresenting the information expressed by a given natural languagestring (language understanding), are all fundamental and largelyunsolved natural language processing problems. Current approaches can beroughly divided into symbolic, statistical, and distributional:

[0002] 1) Symbolic or rule-based (rules about symbols) methods whichseek to find a combination of rules which describe a given string interms of a set of symbols.

[0003] 2) Statistical methods which optimize the information over largernumbers of simple observations to make predictions about a given string.

[0004] 3) Distributed or memory-based methods which keep the largenumbers of simple observations and use them to define or recognizeelements of natural language in terms of assemblies.

[0005] Rule-based methods keep a list of all possible relationshipsbetween all possible classes of language tokens, then at processing timethey look up the tokens in a dictionary and attempt to decide which oftheir many possible classes, in which of many possible combinations,best describes a given string. E.g.,

[0006] All possible classes (dictionary):

[0007] the<-DET

[0008] Rain<-N

[0009] in<-PREP

[0010] Spain<-N

[0011] . . .

[0012] All possible relations:

[0013] NP<-DET+N

[0014] PP<-PREP+N

[0015] NP<-NP+PP

[0016] . . .

[0017] Analysis: ((The (DET)+rain (N))+(in (PREP)+Spain (N)))

[0018] The above bracketing, denoting a sequence of rule combinations,defines a dependency tree or structure between the elements, as follows,called the “parse tree”:

[0019] Statistical methods also keep a list of all possiblerelationships between all possible classes, but in they simplify modelbuilding by considering, in general, more, simpler relationships, anduse statistics to summarize regularities and optimize predictions. E.g.,in+Spain, on+Spain=>posit a statistical variable PREP where (“in”, “on”)are members of PREP and “Spain” follows PREP with probabilityP(PREP|“Spain”). The analysis is as with rules, but now each combination(branch of the tree) has a probability.

[0020] Distributed methods also build classes among more, simplerrelationships, but they don't summarize the information. They gain inflexibility and robustness by representing classes directly in terms ofcollections. E.g., in+Spain, on+Spain=>posit a paradigmatic(paradigmatic-sets of alternative, as opposed to syntagmatic-sets ofconsecutive) vector class PREP where “in”, “on” are examples of PREP and“Spain” is a component of PREP. (Or inversely, defining the vector interms of equivalents rather than contexts, posit a vector class PREPwhere (“in”, “on”) are components of PREP, and PREP is defined as theset of all things which precede “Spain”, . . . ). The analysis is aswith rules, but you can have partial matches between vector classes.“Vector” is a well-known expression in natural-language processing; andfor present purposes a vector may be briefly described as a list.

[0021] What distributed models gain in flexibility and robustness ofrepresentation (partial matches between classes), however, they sufferby being unwieldy to describe (sets instead of symbols), and for alltheir advantages in modeling static concepts (see the extensive vectorinformation retrieval literature, e.g. Schuetze, U.S. Pat. No.6,173,261) there is no consensus on how they might be advantageous formodeling combinations of concepts (syntax or grammar-which is usuallystill done by keeping a list of all possible relationships between allpossible classes).

[0022] The interesting thing about all these prior art models forlanguage processing is that no-one has yet been able to compile a trulycomprehensive list of all possible relationships between all possibleclasses of language tokens. And even probabilities only minimize theerrors which arise from incomplete information (often called the “sparsedata problem”); they don't eliminate them. The models don't quite seemto fit the problem. The present status of natural language processingmight justifiably be compared with the status of artificial flightbefore the discovery of the airfoil.

[0023] Vectors of associative properties have become a popular method ofrepresenting the grammar and meaning of words and sometimes wordsequences (U.S. Pat. No. 6,173,261 is herein incorporated by reference).But hitherto this has mainly been because of their flexibility androbustness, not generally because of their generative power. I thinkthis power is necessary. In simple terms, the failure in accuracy of theprior art can be expressed as a failure to explain why the expression“strong tea” is preferred over the expression “powerful tea” or why wetend to say “apples and oranges” instead of “oranges and apples”, “baconand eggs”, “chalk and cheese”, and any one of a number of gradations ofsyntactive restrictiveness between these and what are recognized aserrors in traditional grammar. E.g., from the literature:

[0024] It's/that's easier said than done

[0025] I'm terribly sorry to hear that

[0026] You can't believe a word he says

[0027] I see what you mean

[0028] sad to say

[0029] time of day

[0030] in advance

[0031] (verb) the un(verb)able

[0032] . . .

[0033] The issue for accurate syntax modeling is why are we comfortable,even familiar, with these examples, but less so with, say:

[0034] That is easier spoken than done

[0035] I am terribly happy to hear that

[0036] You can't believe the word he says

[0037] I see the thing you say

[0038] sad to mention

[0039] time of week

[0040] in forward

[0041] Not only is the boundary subtle, but it is fuzzy (there aredegrees of distinction). In a classical rule-based or statisticalprocessor we would need a class for every such distinction (and degreeof distinction). The power of combinations of examples provides a morepractical solution. While we cannot imagine listing classes for eachdistinction, it is easy to imagine producing a unique combination ofexamples which distinguishes each, and which provides a fuzzydistinction. “That is easier said than done”, if used often enough, candefine its own class and explain itself, all the while providingelements which can form other classes and explain broader regularitiesin expressions of the type “That is than ______”, or “That ______easier”. “Strong tea” and “powerful tea” might be distinguished becausethe distribution of word associations associated with “strong” isdifferent from that associated with “powerful”, in detail, if notgeneralities.

[0042] While current distributed systems have the power to describe suchsubtleties of representation, they are limited by the perception thatgrammatical groupings represent immutable qualities, that classes are tobe found not forced.

OBJECTS OF THE INVENTION

[0043] A primary object and feature of the present invention is tofulfill the above-mentioned need by the provision of a system whichmakes linguistic analysis distinctions based on such ad hoc collectionsof natural language strings existing in a repository of text or corpus.A further primary object and feature of the present invention is toprovide such a system which is efficient and computer-implementable. Inaddition, it is a primary object and feature of this invention toprovide such a system in connection with discerning the relativegrammaticality of potential parses and making use of scoring systemstherefor. Other objects and features of this invention will becomeapparent with reference to the following invention descriptions.

SUMMARY OF THE INVENTION

[0044] According to a preferred embodiment of the present invention,this invention provides a computer system, using a provided corpus oflinear natural-language elements of natural language text string data ina subject language and an input string of natural-language elements inthe subject language, for assisting natural-language processing,comprising, in combination: for a first adjoining pair, comprising afirst pair element and a second pair element, of such natural-languageelements of such input string, finding, from such string data from suchcorpus, a first listing of each such element syntactically equivalent tosuch first pair element and a second listing of each such elementsyntactically equivalent to such second pair element; from matching eachsuch first-listing element with each such second-listing element, makinga matched-pairs third listing by finding which matched pairs of saidmatching are found in such string data from such corpus; and for suchmatched pairs of such matched-pairs third listing, finding, from suchstring data from such corpus, a fourth listing of each fourth suchnatural-language element syntactically equivalent to any such matchedpair of said third listing. It further provides such a system furthercomprising scoring each such natural-language element of such fourthlisting, such scoring comprising counting the number of occurrences ofeach such natural-language element of such fourth listing in such stringdata from such corpus. And it provides such a system further comprising,for such fourth natural-language elements of such fourth listing,finding, from such string data from such corpus, a fifth listing of eachsuch natural-language element syntactically equivalent to any suchfourth natural-language element.

[0045] Moreover, it provides such a system further comprising scoringeach such natural-language element of such fifth listing, such scoringcomprising counting the number of occurrences of each suchnatural-language element of such fifth listing in such string data fromsuch corpus, and it provides such a system further comprising, for suchnth natural-language elements of such nth listing, finding, from suchstring data from such corpus, an (n+1)th listing of each suchnatural-language element syntactically equivalent to any such nthnatural-language element. It also provides such a system furthercomprising scoring each such natural-language element of such (n+1)thlisting, such scoring comprising counting the number of occurrences ofeach such natural-language element of such (n+1)th listing in suchstring data from such corpus. Also, it provides such a system furthercomprising: for a second adjoining pair, comprising such first adjoiningpair as a second first pair element and another natural-language elementadjoining such first adjoining pair as a second second pair element, ofsuch natural-language elements of such input string, finding, from suchstring data from such corpus, a second first listing of each suchelement syntactically equivalent to such second first pair element and asecond second listing of each such element syntactically equivalent tosuch second second pair element; from matching each such secondfirst-listing element with each such second second-listing element,making a matched-pairs second third listing by finding which matchedpairs of said matching are found in such string data from such corpus;and for such matched pairs of such matched-pairs second third listing,finding, from such string data from such corpus, a second fourth listingof each second fourth such natural-language element syntacticallyequivalent to any such matched pair of such second third listing. And itprovides such a system further comprising scoring each suchnatural-language element of such fourth listing, such scoring comprisingcounting the number of occurrences of each such natural-language elementof such fourth listing in such sting data from such corpus. Evenfurther, this invention provides such a system further comprising: foran (n+1)th adjoining pair, comprising such nth adjoining pair as an(n+1)th first pair element and another natural-language elementadjoining such nth adjoining pair as an (n+1)th second pair element, ofsuch natural-language elements of such input string, finding, from suchstring data from such corpus, an (n+1)th first listing of each suchelement syntactically equivalent to such (n+1)th first pair element andan (n+1)th second listing of each such element syntactically equivalentto such (n+1)th second pair element; from matching each such (n+1)thfirst-listing element with each such (n+1)th second-listing element,making a matched-pairs (n+1)th third listing by finding which matchedpairs of said matching are found in such string data from such corpus;and for such matched pairs of such matched-pairs (n+1)th third listing,finding, from such string data from such corpus, an (n+1)th fourthlisting of each (n+1)th fourth such natural-language elementsyntactically equivalent to any such matched pair of such (n+1)th thirdlisting. And it provides such a system further comprising scoring eachsuch natural-language element of such (n+1)th fourth listing, suchscoring comprising counting the number of occurrences of each suchnatural-language element of such (n+1)th fourth listing in such stringdata from such corpus. It also provides such a system according to suchsteps first set out in this summary, further comprising: repeating suchsteps while considering such original first adjoining pair as a newfirst pair element in such repeating, such original fourth listing as anew first listing in such repeating, and a new natural-language elementadjoining, in such input string, such new first pair element as a newsecond pair element, thereby providing a new first adjoining pair,thereby providing a new fourth listing in association with such newfirst adjoining pair. And it provides such a system further comprising:re-performing the just-above steps while considering such new firstadjoining pair as a first replacement first pair element in suchre-performing, such new fourth listing as a first replacement firstlisting in such re-performing, and a further new natural-languageelement adjoining, in such input string, such first replacement firstpair element as a first replacement second pair element, therebyproviding a first replacement first adjoining pair, thereby providing afirst replacement fourth listing in association with such firstreplacement first adjoining pair. Also it provides such a system furthercomprising: further continuing to perform, for such entire input string,such just-above steps while considering such nth first adjoining pair asan (n+1)th replacement first pair element in such further performing,such nth fourth listing as an (n+1)th replacement first listing in suchfurther performing, and a further new natural-language elementadjoining, in such input string, such (n+1)th replacement first pairelement as an (n+1)th replacement second pair element, thereby providingan (n+1)th replacement first adjoining pair, thereby providing an(n+1)th replacement fourth listing in association with such (n+1)threplacement first adjoining pair.

[0046] Additionally, this invention provides such a system furthercomprising: for an (n+1)th adjoining pair, comprising such nth adjoiningpair as an (n+1)th first pair element and another natural-languageelement adjoining such nth adjoining pair as an (n+1)th second pairelement, of such natural-language elements of such input string,finding, from such string data from such corpus, an (n+1)th firstlisting of each such element syntactically equivalent to such (n+1)thfirst pair element and an (n+1)th second listing of each such elementsyntactically equivalent to such (n+1)th second pair element; frommatching each such (n+1)th first-listing element with each such (n+1)thsecond-listing element, making a matched-pairs (n+1)th third listing byfinding which matched pairs of said matching are found in such stringdata from such corpus; and for such matched pairs of such matched-pairs(n+1)th third listing, finding, from such string data from such corpus,an (n+1)th fourth listing of each (n+1)th fourth such natural-languageelement syntactically equivalent to any such matched pair of such(n+1)th third listing. It also provides such a system furthercomprising: scoring each such natural-language element of such (n+1)thfourth listing, such scoring comprising counting the number ofoccurrences of each such natural-language element of such (n+1)th fourthlisting in such string data from such corpus; wherein said scoringcomprises a similarity measure for statistical similarity between suchscored natural-language element and such string data from such corpus;and wherein such scores for each such natural language element of such(n+1)th fourth listing are essentially added to determine a scoring fora string comprising such (n+1)th replacement first adjoining pair. Andit provides such a system wherein such computer system is applied topossible ordered string subcombinations of at least two potential parsesof such natural-language elements of such input string and a highestsuch scoring among such potential parses is used to determine maximumgrammaticality among such potential parses. And it provides such asystem wherein said scoring comprises a similarity measure forstatistical similarity between such scored natural-language element andsuch string data from such corpus; and, further, wherein such scoring ofeach such fourth list element comprises: the product of a measure ofstatistical similarity between each such element (of such first listing)syntactically equivalent to such first pair element and such first pairelement; a measure of statistical similarity between each such element(of such second listing) syntactically equivalent to such second pairelement and such second pair element; a measure of statisticalassociation between such first and second pair elements; and a measureof statistical similarity between each matched pair of suchmatched-pairs third listing and each fourth such natural-languageelement of such fourth listing; and the sum of each such product foreach such third list element.

[0047] Even moreover, according to a preferred embodiment thereof, thisinvention provides a computer system, using a provided corpus of linearnatural-language elements of natural language text string data in asubject language and an input string, to be parsed, of natural-languageelements in the subject language, for assisting natural-languageparsing, comprising, in combination: for each of at least twonatural-language input subcombinations which are potential subparses ofsuch input string, building an equivalence list of all corpus stringssyntactically equivalent to such each input string subcombination; fromsuch equivalence lists, in different orders for each potential parse ofsaid input string, building to a final equivalence list for each suchpotential parse of such input string; and from the number and quality ofentries in each respective such final equivalence list, scoring thegrammaticality of such respective potential parse; and, further, whereinsuch scoring comprises essentially adding scores for each such entry toobtain a score for such potential parse.

[0048] Yet in addition, in accordance with a preferred embodimentthereof, this invention provides a computer system, using a providedcorpus of linear natural-language elements of natural language textstring data in a subject language and an input string ofnatural-language elements in the subject language, for assistingnatural-language processing, comprising, in combination: for a firstadjoining pair, comprising a first pair element and a second pairelement, of such natural-language elements of such input string,finding, from such string data from such corpus, a first listing of eachsuch element syntactically equivalent to such first pair element and asecond listing of each such element syntactically equivalent to suchsecond pair element; and from matching each such first-listing elementwith each such second-listing element, making a matched-pairs thirdlisting by finding which matched pairs of said matching are found insuch string data from such corpus; wherein at least one of said firstadjoining pair comprises at least a pair of natural-language elements;and, further, wherein at least one of such first pair element and suchsecond pair element comprises at least a pair of words. And it providessuch a system wherein each such pair element comprises at least oneword; and, further, wherein each such pair element comprises at leasttwo words.

[0049] Also, according to a preferred embodiment thereof, it provides acomputer-readable medium (for a computer system, using a provided corpusof linear natural-language elements of natural language text string datain a subject language and an input string of natural-language elementsin the subject language, for assisting natural-language processing)whose contents cause a computer system to determine a grammatical parseby: for each of at least two natural-language input subcombinationswhich are potential subparses of such input string, building anequivalence list of all corpus strings syntactically equivalent to sucheach input string subcombination; from such equivalence lists, indifferent orders for each potential parse of said input string, buildingto a final equivalence list for each such potential parse of such inputstring; and from the number and quality of entries in each respectivesuch final equivalence list, scoring the grammaticality of suchrespective potential parse.

[0050] Even further, according to a preferred embodiment thereof, thisinvention provides a computer-implemented natural-language system (for acomputer system, using a provided corpus of linear natural-languageelements of natural language text string data in a subject language andan input string of natural-language elements in the subject language,for assisting natural-language processing) comprising: for each of atleast two natural-language input subcombinations which are potentialsubparses of such input string, means for building an equivalence listof all corpus strings syntactically equivalent to such each input stringsubcombination; means for building, from such equivalence lists, indifferent orders for each potential parse of said input string, to afinal equivalence list for each such potential parse of such inputstring; and means for scoring, from the number and quality of entries ineach respective such final equivalence list, the grammaticality of suchrespective potential parse.

BRIEF DESCRIPTION OF THE DRAWINGS

[0051]FIG. 1a illustrates the well-known components of a typicalcomputer system.

[0052]FIG. 1b illustrates an outline of a parser or a languageunderstanding system

[0053]FIG. 2a illustrates a grammar as allowable sequences among afinite set of word classes.

[0054]FIG. 2b illustrates a grammar as a vector of interchangeablestrings (one general paradigm of the instant system).

[0055]FIG. 3 illustrates an example of an implementation of the instantsystem in the Perl programming language.

[0056]FIG. 4 illustrates a worked example of the “forcing function”(paradigmatic analogy) forcing a reassociation of strings to representthe string “the rain in Spain” (in this case the best scoring sequenceof associations only is shown).

[0057]FIG. 5 illustrates a worked example of paradigmatic analogyforcing a non-best case association of strings to represent the string“the rain in Spain” (note that most matches are exact equivalents, thereis no generalization to equivalent strings. In general bad sequencematches will be trivial, and thus not result in the largest matchsets=scores).

[0058]FIG. 6 illustrates a of closest art-estimation of word associationprobabilities between pairs of elements by paradigmatic analogy.

[0059] Table 1-1 illustrates the nature of data on which the systembases its natural language processing decisions, i.e., raw unannotatedtext, usually as part of a large corpus.

[0060] Table 1-2 illustrates a list of words and word groups containedin such a text in the contexts they occur with in that text (here thecontext is limited to one preceding word and one following word)

[0061] Table 1-3 illustrates a list of words and word groups in contextfrom that text, sorted alphabetically according to that context theyoccur with in that text.

[0062] Table 1-4 illustrates a list of words and word groups in thattext indexed by the prior and following word contexts they occur with inthat text.

[0063] Table 1-5 illustrates observed word and word group similarityTable, i.e., list of words and word groups which occur in common priorand following word contexts in that text, indexed on each other, andscored by a measure of similarity between the words according to thenumber of common contexts they occur with in the reference text.

[0064] Table 1-6 illustrates a line from observed word and word groupsimilarity Table with actual similar words and word groups replaced orindexed by post numbers for easier matching.

[0065] Table 2-1 illustrates matches between entries in the word andword group similarity Table for “the” and “rain”.

[0066] Table 2-2 illustrates recombination of Table entries for matchedpairs to synthesize a Table entry for “the rain”.

[0067] Table 2-3 illustrates matches between entries in the word andword group similarity Table for “in Spain”.

[0068] Table 2-4 illustrates recombination of Table entries for matchedpairs to synthesize a Table entry for “in Spain”.

[0069] Table 2-5 illustrates matches between entries in the word andword group similarity Table for “(the rain)” and “(in Spain)”.

[0070] Table 2-6 illustrates recombination of Table entries for matchedpairs to synthesize a Table entry for “(the rain) (in Spain)”.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT AND THE BEST MODE OFPRACTICE

[0071] A simple rule based system relying on a listing of possiblesequences of classes for all possible classes of all possible linguistictokens is shown schematically in FIG. 2a. In this model “The rain inSpain” is considered an acceptable string of natural language for thepurposes of applications such as those described in FIG. 1b if thereexists a dictionary and a set of rules such that a class can be found inthe dictionary for each word in a posited string, in an order consideredacceptable by the set of rules in the processor. FIG. 2b shows the styleof grammatical representation used in this method, stylized here as a“vector” representation of grammar. A posited string is grammaticalbecause there are many similar established or observed strings, theelements or the entirety of which occurs in common paradigmatic contexts(same preceding and following words). This invention provides animprovement to current vector or distributed systems based on thesupposition that the essential mechanism of language processing is therecombination of paradigmatic sets language elements (letters, wordsetc.) making up a distributed representation of grammar to create newdistributed or vector classes for each novel string or sequences ofwords with which the system is presented.

[0072] By recombination I mean that the elements of vectors representingobserved natural language strings are recombined to estimate vectorsrepresenting unobserved natural language strings. This recombination ofvector elements to create new vectors associated with novel combinationsgives this system a new power to describe idiosyncracies of linguisticassociation over and above the prior art of symbolic, statistical, orother vector distribution based methods. By power I mean thatparadigmatic vectors of word combinations can be combined to give othervectors which also describe (slightly different) word combinations. E.g.If we add “over” to the vector (“in”, “on”) we get a new vector (“over”,“in”, “on”) which combines fine with a word like “Spain” but alters thefit of the class as a whole with a word like “due” (“over due”), and ifwe remove “on” from the same class we get another (“over”, “in”), whichworks slightly differently from “on” (we can group “on” with “about”)and so on. Each of these vectors will be slightly different in terms ofthe syntactic property it models. The system described here differs fromthe prior vector distribution of word association art in that itharnesses the power of elements of vectors to be recombined to createnew vectors. It is not only a vector model but a recombinatorial vectormodel. It is also a preferred feature that these recombinations ofparadigmatic vectors (sets of equivalents) are “forced” by sequences ofsyntagmatic matches between elements of paradigmatic vectorsrepresenting the “class” of each word or word sequence, as calculatedfrom lists of examples beforehand, or produced from the lastrecombination of processing (E.g. FIG. 4 “The rain in Spain”)*. The sizeand “goodness of fit” of these “forced” classes is the essence of theprinciple of grammaticality and syntax (the thing which controlspossible combinations, and the order or hierarchy of combination) whichis an underlying feature herein.

[0073] The described system is mainly implemented as a parser, but withthe correspondence to the conceptual classes of vector informationretrieval it might be seen as a language understanding system, where therepresentation of the meaning of a presented sentence is the producedvector of similar words. In this, it is quite similar to the vectors ofsimilar words or word associations which are used in many current vectorinformation retrieval applications (e.g. Schuetze U.S. Pats. 5,864,846,5,675,819).

[0074] Summarizing: processing seeks to find, in a sense, a “resonance”,or new grammatical regularity (“class”), between the presented stringand sets of observed associations embodied in a word and word groupsimilarity Table. Central is the creation new perspectives from existinginformation: recombination. The underlying concept may also be thoughtof as that of a “perspective generator”. Dagan et al's publications onthe analogous estimation of probabilities for unobserved pairs betweenwords is the closest work known to this applicant. While there theymatch sets of words (where by “match” I mean find observed pairs betweena set of words with similar associative properties to one word, and aset of words with similar associative properties to another) to generateanalogous pairs in some similar ways as the matches which are used toforce sets of recombinations here, they do not recombine vectorsassociated with sets of these analogous pairs to define new, vector,grammatical elements, and they do not use these vectors, iteratively, togenerate new vectors defining a phrase structure and representing ameaning for entire strings.

[0075] The detailed elements of processing are as follows. Withreference to FIG. 1, the system takes data of example strings from acorpus of text and creates a Table of word and word group similaritybetween words in that corpus. This Table is used by the system tointerpret natural language text for some Natural Language Processingapplication such as a speech recognition engine, a parser, a grammarchecker, a search engine, and so forth. The application sends the stringto be analyzed to the system and the system returns an analysis orinterpretation of that Natural Language string with reference to thedata in the corpus. For instance the inputs and the outputs of thesystem can be:

[0076] Inputs:

[0077] 1. The string or sentence to be “recognized” (parsed, analyzed orinterpreted).

[0078] 2. A Table of word (or other linguistic unit) associations andsimilarities. These Tables are much like those calculated by Dekang Lin(An Information-Theoretic Definition of Similarity. Proceedings ofInternational Conference on Machine Learning, Madison, Wis., July,1998), and others. With the difference that (at least) word pairs arealso related to other pairs and/or single words (this is necessary forparsing and the representation of arbitrarily long strings with shorterstrings). In general these Tables can be derived directly fromdistributions of word associations in flat, unannotated text.

[0079] Outputs:

[0080] 1. A “class”, a new entry for the word similarity Table, derivedfrom a collection of similar words filtered by a sequence of pairings.This new entry represents the “grammar” of the presented string.

[0081] The order of paradigmatic matches, where at least some matchelements contain one or more language elements, which produces thebiggest or best such class gives a “parse tree” for the string. This isshown in the embodiment described here by the order of bracketing in thehead string of the new entry in the word similarity Table. FIG. 3A givespseudocode for an algorithm to implement the method. The steps of thecode are shown conceptually in FIG. 4. They are as follows.

[0082] Given the Input:

[0083] String to be analyzed: “The rain in Spain”

[0084] A Word and Word String Similarity Table with at Least Entries:

[0085] THE|the, a, this, its, their, his, an, your, our, these . . .

[0086] RAIN|rain, scarecrow, wind, federal government . . .

[0087] IN|in, throughout, during, anywhere in, of, as in . . .

[0088] SPAIN|Spain, this regard, North America, this case, this area,recent years, the evenings, any case, Italy, Europe . . .

[0089] THE RAIN|the organization, the province . . .

[0090] THE SCARECROW| . . .

[0091] THE FEDERAL GOVERNMENT|the law . . .

[0092] IN SPAIN| . . .

[0093] IN THIS REGARD|in this regard, in this case . . .

[0094] THROUGHOUT SPAIN|throughout Spain, in Spain . . .

[0095] IN NORTH AMERICA|. . .

[0096] THE PROVINCE IN SPAIN|hotels throughout Spain, crime in Kenya . ..

[0097] THE LAW IN THIS CASE|Tanzania in particular . . .

[0098] The system will give an output of the form of a new entry in theword and word string similarity Table: ((the rain)(in Spain))|hotelsthroughout Spain, crime in Kenya, Tanzania in particular . . . (wherethe bracketing of the head string for the new entry gives the “parse”structure of the analysis).

[0099] The steps to produce this new Table entry are (as shown in FIG.4):

[0100] 1. Match elements of entry for “the” and entry for “rain”, i.e.,we sift between all possible combinations of the words and word stringssimilar to “the” and “rain”, as expressed in the Table made from ouroriginal data text (corpus) to find analogous pairs which are observed(they are actually found in the text on which the word and wordsimilarity Table is based, and there was enough text to generate a Tableentry for them.) For the purposes of this example let us say we findmatches: “the rain”, “the scarecrow”, “the Federal Government”, . . .

[0101] 1a. Create a new Table entry for the pair “(the rain)” byrecombining elements from the Table entries of these matches, i.e.,entries from: THE RAIN+THE SCARECROW+THE FEDERAL GOVERNMENT . . .

[0102] Let's say we find these to be: the organization, the province,the law . . . We now have a new Table entry: (the rain)|theorganization, the province, the law . . . Note that if our analysisstring was just “the rain” then processing would now be over and thesize and strength (see scoring below) of the matched class would be ameasure of the grammaticality of the presented combination “the rain”.This may be compare this with Dagan's work where they average theprobability of matched analogous pairs to estimate a probability for anunobserved pair. The grammatical definition given here i.e. thatgrammaticality is based on the strength/similarity/quality of elementsin a vector of similar elements to that string might be seen as a basisfor the accuracy of their probability estimates (accurate to the extentthat probability is dependent on grammar and meaning alone).

[0103] 2. Match elements of entry for “in” and entry for “Spain”

[0104] For the purposes of this example say we get matches: “in Spain”,“in this regard”, “throughout Spain”, “in North America”, . . .

[0105] 2a) Create a new Table entry for the pair “(in Spain)” byrecombining elements from the Table entries of these matches, i.e.,entries from: IN SPAIN+IN THIS REGARD+THROUGHOUT SPAIN+IN NORTH AMERICA. . . Let's say we find those to be: in Spain, throughout Spain, in thiscase, in this regard . . . We now have another new Table entry: (inSpain) I in Spain, throughout Spain, in this case, in this regard . . .

[0106] 3. Match elements of (newly created) entry for “(the rain)”, and(newly created) entry for “(in Spain)”. For this example I have listedthese to be: the province in Spain, the law in this case . . .

[0107] 3a. Create a new Table entry for the pair “((the rain) (inSpain))” by recombining elements from the Table entries of thesematches, here entries from: THE PROVINCE IN SPAIN+THE LAW IN THIS CASE .. . In the example figure (FIG. 4), I have these as: hotels throughoutSpain, crime in Kenya, Tanzania in particular, . . . We now have a newTable entry: ((the rain) (in Spain))|hotels throughout Spain, crime inKenya, Tanzania in particular . . . This new Table entry can beconsidered to be a grammatical representation of the string “the rain inSpain”, and the bracketing structure of its head string gives the parsestructure it is associated with (the order of pairing needed to produceit):

[0108] Which can be compared to a rule-based parse of the kind:

[0109] The difference is that the “classes” (NP, PP, DET, N, PREP) arenow fuzzy, robust, flexible, sets or vectors, and the validity of theparse as a whole is based on a fuzzy, robust, flexible measure, the sizeand strength of the synthesized class for “((the rain) (in Spain))”. Incontrast to symbolic classes this can represent any degree ofrestriction, from strict collocational habit (e.g. “easier said thandone”) to broad grammatical generalizations (e.g. words like“easier”+words like “said” c.f. ADJ+V). This differs from previousdistributed methods in that we can now have an enormous number of suchclasses, one for each possible string in the language. Classes do notexist to be identified but are perspectives to be found, some“resonating” more strongly than others. The size or strength, the“resonance” of that class is recognized to be the basis of a concept ofgrammaticality. The principle of syntax is that these perspectives arefound by forcing new recombinations of paradigmatic vectors according tosyntagmatic combinations.

[0110] As to scoring, of course the method of forcing recombinationswill give more new entries than the “correct” or “best path” one, givenin the above example. In fact it will produce a new word and word stringsimilarity Table for every sequence of combination of initial elementsinto one final pair, i.e.,

[0111] (the (rain (in Spain)))|???

[0112] (the ((rain in) Spain))|???

[0113] ((the rain) (in Spain))|???

[0114] ((the (rain in)) Spain))|???

[0115] (((the rain) in) Spain) ???

[0116] A sequence of matches leading to a non-best case association ofstrings is shown in FIG. 5. Note that most matches are exactequivalents, there is no generalization to equivalent strings. It is thegreater power of generalization of “strong” structural elements(intuitively reasonable) which is important in determining structure.They will tend to generalize better to strings of different lengths(e.g. “anywhere in” will also be equal to “in”, as well as “somewherein”. In general bad sequence matches will be trivial, and thus notresult in the largest match sets=scores.

[0117] The Table entry which provides the best analysis, or parse forthe input string is decided by a scoring mechanism on the synthesizedTable entry. Currently the scoring function I use is to take the simplesum of an estimated similarity values for each member of the new Tableentry. This can be thought of as measuring the “goodness of fit” betweenthe order of combinations to produce the new entry and the data in therest of the word and word string similarity Table representing thesyntax of the whole language.

[0118] Essentially the concept of parse is based on the comparison (foreach triplet of language elements):

[0119] ((go with) her)->((??) her). . . etc. substituting the first pair(go with) with “syntactically similar” equivalents or

[0120] (go (with her)->(go (??)). . . etc. substituting the last pair(with her) with “syntactically similar” equivalents

[0121] Whether we consider the first order of pairings is “moregrammatical” than the second order of pairings depends on how manyassociations we can find in the data with equivalents of the first paircompared with equivalents of the second pair (or how “dense” theassociated group of equivalents are, where density is defined accordingto the convolution of statistical association measures mentioned above).Note that in the current implementation similarity scores for therecombined entries are calculated from the weighted mutual informationbetween matched words (similarity first word×similarity secondword×mutual information between them). It may be under somecircumstances be preferable to use a simple ratio of observed topossible matches to represent the goodness of fit between the matchwords, and provide a value for the similarity score. Note also, theobserved similarity scores calculated in the original word similarityTable are based on immediate word association contexts. It may be usefulto use broader context or other similarity measures. The interaction ofthe system with the corpus data is shown step-by-step in detail inTables 1-1 to 1-6. This processing can all occur “off line” to speedprocessing on a serial computer.

[0122] Table 1-1 shows a short segment of raw text from a corpus to givean idea of the form, if not the size of such a corpus. TABLE 1-1 ...letssee what would get you there then leaving probably the seventh from SanJose or San...

[0123] This text has been normalized by removing all capitalization,punctuation, and placing separate sentences on separate lines. Tocalculate an initial word and word group similarity Table to providesets of “equivalent” words and word groups as used in the example inFIGS. 4 and 5, the system first segments all the text in the format ofTable 1-1 to create a list of all words and word groups (up to a givenlength, in this embodiment, length 3) in a context of the word whichprecedes the word group and the word which follows the word group. Acomplete list of such segments for the short extract from a corpus givenin Table 1-1 is shown in Table 1-2. TABLE 1-2 List of Word Groups inContext: lets see&what&would get lets see&what would lets see what seewhat&would&get you see what&would get see what would what would&get&youthere what would&get you what would get would get&you&there then wouldget&you there would get you get you&there&then leaving get you&therethen get you there you there&then&leaving probably you there&thenleaving you there then there then&leaving&probably the therethen&leaving probably there then leaving then leaving&probably&theseventh then leaving&probably the then leaving probably leavingprobably&the&seventh from leaving probably&the seventh leaving probablythe probably the&seventh&from San probably the&seventh from probably theseventh the seventh&from&San Jose the seventh&from San the seventh fromseventh from&San&Jose or seventh from&San Jose seventh from San fromSan&Jose&or San from San&Jose or from San Jose ...

[0124] The system has taken the first 5 words of the extract and listeda word group consisting of the second the third and the forth, joinedtogether to make one group in the context of the first and the fifthwords, and a word group consisting of the second and third, joinedtogether to make one group in the context of the first and the fourth,and a word group consisting only of the second word in the context ofthe first and third. It then lists a similar set of all possible wordsand word groups in context for the second 5 words of the extract,starting at the second word and extending to the sixth, so the words andword groups in context consist of the third, fourth and fifth words inthe extract, joined together to make one group in the context of thesecond and sixth word, and the third and forth word in the context ofthe second and fifth, the third in the context of the second and fourthetc. The complete group of all such possible divisions for the shortextract from a corpus shown in Table 1-1 is given in Table 1-2. Inpractice the system generates all possible such divisions for the wholecorpus of text on which the natural language processing decisions are tobe based. Table 1-3 shows an excerpt from such a list calculated for anentire corpus of extracts such as that shown in Table 1-1, sortedalphabetically on the context of first and last words and contracted sothat each unique word and word group in a given context is shown onlyonce, with the number of times it occurred in the entire corpus writtenas a number preceding it. TABLE 1-3 Sorted Word Group In Context List: 6up an&hour&and a 8 up for a 10 up having a 8 up in a 7 up on&delta&as a8 up sometimes&international&takes a 7 up to&you a 7 up with a 8 upafter&tomorrow&how about 8 up to ah 9 up between airlines 8 upgive&me&an airport 9 up here&nine&forty am 8 up give&me an 6 up an&hourand 9 up their and 6 up time&um&ok and 9 up to&new&york and 8 upto&one&eighteen and 11 up your&profile and 9 up and&I apologize 8 up orare 7 up on&delta as 6 up right at 8 up right away 9 up and&that&wouldbe 6 up everythingll be 15 up to be 9 up you&know before 7 up a bit 9 upthe&they call 8 up if&I&had called 8 up with&whatever&we can 7 up ondelta 9 up the&itinerary&oh do 8 up what&rate do 7 up for&reticket&Idont 7 up they dont 8 up to&one eighteen 8 up to&uh&to fairbanks 6 upeverythingll&be&just fine

[0125] For instance, the word group “an hour and” is shown in Table 1-3as having occurred 6 times in the context of the preceding word “up” andfollowing word “a” in the entire corpus of text on which processing inthis case is to be based. Similarly the word “for” has been found tooccur 8 times in this context, “after tomorrow how” to occur 8 times inthe context of “up” and “about”, and “between” to occur 9 times between“up” and “airlines”. Table 1-4 shows the same information in a differentformat with all the words and word groups which occur in a givencontext, indexed on that context. TABLE 1-4 Common Context Table: up_a|an&hour&and#6 for#8 having#10 in#8 on&delta&as#7sometimes&international&takes#8 to&you#7 with#7 up_about|after&tomorrow&how#8 up_ah| to#8 up_airlines| between#9 up_airport|give&me&an#8 up_am| here&nine&forty#9 up_an| give&me#8 up_and| an&hour#6their#9 time&um&ok#6 to&new&york#9 to&one& eighteen#8 your&profile#11up_apologize| and&I#9 up_are| or#8 up_as| on&delta#7 up_at| right#6up_away| right#8 up_be| and&that&would#9 everythingll#6 to#15 up_before|you&know#9 up_bit| a#7 up_call| the&they#9 up_called| if&I&had#8 up_can|with&whatever&we#8 up_delta| on#7 up_do| the&itinerary&oh#9 what&rate#8up_dont| for&reticket&I#7 they#7 up_eighteen| to&one#8 up_fairbanks|to&uh&to#8 up_fine| everythingll&be&just#6 right&away&thats#8 up_first|his&profile#7 your&profile#7 up_flight| the#18 with&a&united#7 up_for|a&reservation#8 an&order#9 and&I&apologize#9 the&profile#18 up_forty|here&nine#9 to&ah&one#8 up_four| to&nine&thirty#7 up_front| at&the#6up_gonna| whether&thats#10 up_got| between&airlines&ive#9 here&ok&ive#8up_great| yeah#9 up_had| if&I#8 ok&now&I#9 up_high| pretty#8 up_him|the&profile&for#9 up_hour| an#6

[0126] So we can see that the words and word groups “an hour and”,“for”, “having”, “in”, “on delta as”, “sometimes international takes”,“to you”, and “with” all occurred in the context of “up” preceding and“a” following in this corpus of text with the “frequency” or number oftimes of 6, 10, 7, 8, 7, and 7 respectively. Information of this kind(common occurrence of words and word groups in context) is what thenatural language processing decisions of the system are based on. InTable 1-5 each word and word group extracted from the corpus in Table1-2 is indexed on each other word or word group with a similarity scorebased on the number of common contexts these words and word groups havebeen found to have in Table 14. TABLE 1-5 an&hour&and#15|an&hour&and#1.00 for#2245| was#0.03 ah#0.02 at#0.01 if#0.03 in#0.03is#0.02 ok#0.02 on#0.07 so#0.02 uh#0.02 going&to&be#0.03 like#0.03what#0.01 with#0.03 and#0.01 but#0.03 for#1.00 get#0.02 thats#0.02its#0.03 this&is#0.02 having#80| having#1.00 in#2530| was#0.02 at#0.02in#1.00 is#0.04 on#0.05 to#0.01 uh#0.01 whats#0.04 returning&on#0.03what#0.02 with#0.03 back#0.02 and#0.02 but#0.02 would&be#0.02 for#0.03thats#0.02 its#0.01 on&delta&as#7| on&delta&as#1.00sometimes&international&takes#8| sometimes&international&takes#1.00to&you#124| to&you#1.00 with#878| in#0.03 on#0.02 with#1.00 for#0.03

[0127] The similarity score may be calculated by a number of standardstatistical measures, of which one is the “mutual information”. Mutualinformation is commonly calculated between all kinds of events (here thestandard measure is applied and events are taken to be occurrences of aword or word group in a given context, if two words or word groups havea high mutual information, and one occurs in a given context, then youexpect the other to too, i.e. they are similar). Mutual informationbetween two words or word groups here is calculated as “the square ofthe number of common occurrences of two words and word groups in thewhole corpus divided by the number of occurrences of each separately”.Thus Table 1-5 gives an extract from a Table of words and word groupsindexed on similar words and word groups according to the mutualinformation calculated from their common contexts shown in Table 1-4.Another such similarity score is given by the ratio of Shannoninformation: 2×Information(common features of word1 andword2)/Information(features word1 only)+Information(features word2only). (See Dekang Lin—An Information-Theoretic Definition ofSimilarity. Proceedings of International Conference on Machine Learning,Madison, Wis., July, 1998).

[0128] The interaction of the system with the input string exchangedwith the natural language processing application in FIG. 1b is shownstep-by-step in Tables 2-1 to 2-6. The sequence of steps parallels thatshown diagrammatically in FIG. 4. The essence of processing is to createa new “ad hoc” class, or “resonance” for each presented string. Theprogram starts with a Table (words and word pairs indexed to similarwords) and seeks to use the similarities between words in the Tableitself to synthesize new entries (from new combinations of existingentries). E.g., if we find a pair “I go” in the input, but don't haveany entry for it in the index, we look at all the words similar to “I”and all the words similar to “go” and see if we can use entries forpairs between those (maybe “he goes”. . . ) to substitute for a genuine“I go” entry. That is all, though it does it for many pairs.

[0129] For use in the code shown in this implementation the Table ofword and word group similarity like Table 1-5 now has actual words andword groups substituted with sequences of position numbers for words andword groups relative to each other, and looks a little like this (Germanwords “aas”, “abhang”, and pair “aber&und” in the basic format: “word1similar_word1_position-1, score_valuesimilar_word_position_(—)2,score_value . . . ”): TABLE 1-6 ... aas|1290, 0.02 6761, 0.02 13734,0.02 22306,0.02 22310,0.02 22324, 0.0222332,0.02 35767,0.02 abhang| 9638,0.12 aber&und| 610,1.00 715,1.00 ...

[0130] Note: where for “abhang” “9638” is a “word_position” for anequivalent to “abhang”, and “0.12” is the similarity score of thatequivalent. The “word_position” is the position assigned to all thewords and word groups which can result in matches which are in the wordand word group similarity Table. Using this instead of the wordsthemselves helps us search for new pairs in the index more efficiently.E.g., if “the week” is the first entry in the word and word groupsimilarity Table we might assign position numbers 1 to “the” and 2 to“week”. If “the decision” is the 232nd entry we might add a positionnumber 232 to all occurrences in the word and word group similarityTable of “the” and 233 to all occurrences of “decision”. The word andword group similarity Table is referred to in the code as the “IndexTable” because of these indices.

[0131] The search for matches is shown in FIGS. 3B-4 under the comment“this is where the actual match takes place”. Using relative positionnumbers of elements of any possible match to a pair which occurs as aword and word group similarity Table entry means the actual matchalgorithm can be as efficient as simply reading the position numberindices of words in position one of the prospective match into anassociative array and then looking up positions of words in position twoin this array. Any valid lookups are a match.

[0132] In the implementation shown in FIG. 3 there are three keyfunctions: PAIR, MATCH, and ADD_INDICIES. PAIR is just a skeleton toprint out all pairs in a string of words. MATCH in this version uses theabove mentioned “word position mask” (between all the equivalent wordsof the first word, and all the equivalent words of the second word) tofind pairs which occur in the word and word group similarity Table.ADD_INDICIES makes the new entry from matches (adding word positionindices). There is also a function REDUCE in this example code, but itsfunction is largely redundant. It is just a place to calculate a newlabel for the new Table entry and call the MATCH function to calculatethat entry.

[0133] Table 2-1 shows a set of matches between words and word groupssimilar to “the” and “rain” according to similarities listed in a Tablesuch as that in Table 1-5. TABLE 2-1 matched pair for the rain:the#1.000000 week#0.070000 match freq: 51 match score: 0.011114 matchedpair for the rain: the#1.000000 decision#0.070000 match freq: 39 matchscore: 0.011076 matched pair for the rain: the#1.000000 diet#0.110000match freq: 79 match score: 0.044070 matched pair for the rain:the#1.000000 plan#0.080000 match freq: 63 match score: 0.010392 matchedpair for the rain: the#1.000000 book#0.110000 match freq: 30 matchscore: 0.017010 matched pair for the rain: the#1.000000 car#0.070000match freq: 51 match score: 0.011005 matched pair for the rain:the#1.000000 world#0.070000 match freq: 193 match score: 0.030560matched pair for the rain: the#1.000000 industry#0.080000 match freq: 54match score: 0.010695 matched pair for the rain: the#1.000000country#0.080000 match freq: 104 match score: 0.018888 matched pair forthe rain: the#1.000000 year#0.050000 match freq: 111 match score:0.006221 matched pair for the rain: the#1.000000 united&states#0.130000match freq: 174 match score: 0.159827 matched pair for the rain:the#1.000000 soviet&union#0.100000 match freq: 121 match score: 0.102826matched pair for the rain: the#1.000000 way#0.060000 match freq: 134match score: 0.014404 matched pair for the rain: the#1.000000public#0.060000 match freq: 99 match score: 0.010493 matched pair forthe rain: the#1.000000 world#0.070000 match freq: 193 match score:0.030560 matched pair for the rain: the#1.000000 us#0.070000 match freq:447 match score: 0.023341 matched pair for the rain:the&japanese#0.090000 market#0.050000 match freq: 38 match score:0.008989 matched pair for the rain: that&the#0.070000 group#0.060000match freq: 8 match score: 0.002669 matched pair for the rain:japanese#0.060000 market#0.050000 match freq: 24 match score: 0.001048matched pair for the rain: japanese#0.060000 company#0.060000 matchfreq: 22 match score: 0.000852 matched pair for the rain:japanese#0.060000 government#0.070000 match freq: 58 match score:0.003152 matched pair for the rain: other#0.070000 party#0.050000 matchfreq: 17 match score: 0.002116 matched pair for the rain: other#0.070000day#0.050000 match freq: 9 match score: 0.001135

[0134] For instance, Table 2-1 shows that the system has found a matchbetween an equivalent to “rain” of “week”, with score 0.07 (i.e. mutualinformation between “rain” and “week” if they were listed in Table 1-5)and the trivial equivalence of “the” to itself (with mutual information1.0). This match is then in its turn scored according to the mutualinformation of the matched pair “the week”. This mutual information, thesame formula, but measuring a different coincidence from that used tocalculate word and word group similarities in the word similarityrelating processes above, calculates the “strength” of the match, i.e.given “the” how likely is it to be followed by “week”. In Table 2-1“match freq.” is the “frequency” or number of occurrences of the wordpair in the corpus, in the case of “the week” this is 51 for the corpuson which this calculation was based. This is squared and divided by theoverall number of occurrences of “the” and “week” in the corpus tocalculate an mutual information value (“$mi” in the code example givenin FIG. 3B). The method then calculates a score for the match based onthe product of this mutual information, the “similarity” given in aTable like that of Table 1-5 of “the” with itself, the similarity givenin a Table like that of Table 1-5 of “rain” with “week and a simplenumeric factor to keep scores from getting too small (in the codeexample of FIGS. 3B-6 this is 10000). This scoring calculation can beseen in the code in FIGS. 3B-6:$match_score=10000*$mi*$next_h_score*$next_p_score;

[0135] This score for the example match for “the rain” between “the” and“week” shown in Table 2-1 is given in the Table as 0.011114. Each of thematches between equivalents for “the” and “rain” has such a score. Forthe match between word equivalents “the plan”, “the” has word similarityscore 1.0, “plan” has word similarity score 0.08, and the product togive $match_b_score above gives a match score 0.010392. For the matchbetween word equivalents “other day” near the bottom of Table 2-1“other” has word similarity score to “the” 0.07, and “day” has wordsimilarity score to “rain” 0.05, and the match score is 0.001135. As inFIG. 4 these matches between equivalents of “the” and “rain” areaccessed in the word and word group similarity Table to synthesize anentry for “the rain” in the Table if we did not have one previously.Elements from the Table entries of the matched pairs are added togetherand put into the new entry. This is shown in Table 2-2. TABLE 2-2 Newelement for Table entry of (  the  rain) is: “ bank” with score 0.000885Added from entries for matched pairs:japanese&market+japanese&company+japanese&government+other&party Newelement for Table entry of ( the  rain) is: “ time” with score 0.025861Added from entries for matched pairs:japanese&market+japanese&government+other&party+the&week+the&decision+the&diet+the&plan+the&book+the&car+the&world+the&industry+the&japanese&market+the&country+the&year+the&united&states+the&soviet&union+the&way New element forTable entry of (  the  rain) is: “ us” with score 0.007778 Added fromentries for matched pairs:japanese&market+japanese&government+other&party+other&day+the&us+the&public+the&diet+the&world+the&japanese&market+the&group New element for Tableentry of (  the  rain) is: “ year” with score 0.000343 Added fromentries for matched pairs:japanese&market+japanese&company+japanese&government

[0136] Tables 2-3 and 24 show the same process for the pair “(inSpain)”. TABLE 2-3 matched pair for in Spain: and#0.120000Europe#0.140000 match freq: 14 match score: 0.004159 matched pair for inSpain: of#0.200000 japan#0.100000 match freq: 60 match score: 0.002157matched pair for in Spain: of#0.200000 tokyo#0.100000 match freq: 33match score: 0.002547 matched pair for in Spain: that#0.080000japan#0.100000 match freq: 76 match score: 0.001720

[0137] TABLE 2-4 New element for Table entry of (  in  Spain) is: “ he”with score 0.000103 Added from entries for matched pairs: that&japan Newelement for Table entry of (  in  Spain) is: “and&Europe” with score0.004159 Added from entries for matched pairs: and&Europe New elementfor Table entry of (  in  Spain) is: “ of&japan” with score 0.002157Added from entries for matched pairs: of&japan New element for Tableentry of (  in  Spain) is: “ of&tokyo” with score 0.002547 Added fromentries for matched pairs: of&tokyo New element for Table entry of (  in Spain) is: “ it” with score 0.000103 Added from entries for matchedpairs: that&japan New element for Table entry of (  in  Spain) is:“ there” with score 0.000086 Added from entries for matched pairs:that&japan

[0138] The new word or word group similarity score, calculated directlyfrom occurrences of common contexts in the original Table, is estimatedfor the element to be added in the new Table entry as the sum ofproducts of the match score for pairs from which the new element isbeing added together with the existing similarity score in the word andword group similarity Table between that pair and the element to beadded. This calculation is shown in the code of FIG 3B-7 as:$new_score=$match_score*$score—and this score is added to any existingestimation of a word or word group similarity score for this element inthe new Table entry—Shown in FIGS. 3B-7 as:

$$match_indicies {$index}=$$match_indicies {$index}+$new_score

[0139] For instance in Table 2-2 a new element for the new Table entryfor “the rain” is “bank” which is added from existing Table entries for“japanese market”, “japanese company”, “japanese government”, and “otherparty”, with sum of scores as the above products and sums equaling0.000885. Similarly the new element for the new Table entry for “therain” of “time” is added from existing Table entries for “japanesemarket”, “japanese government”, “other party”, “the week”, “thedecision”, “the diet”, “the plan”, “the book”, etc. That is all thematches made for “the rain” which have “time” in their own word and wordgroup similarity Table listings. The product of their existingsimilarity scores and the respective match scores for these contributingmatches sums to 0.025861. Finally we end up with a new word and wordgroup similarity Table entry for a potentially completely newcombination of words which can be used in further matches of the samekind as that between the Table entry for “the” and “rain” shown here.Such a further or “higher order” match is shown in Table 2-5. TABLE 2-5matched pair for (the rain) ( in Spain):  us#0.007778and&Europe#0.004159 match freq: 8 match score: 0.000219 matched pair for(the rain) (in Spain): bank#0.000885 of&japan#0.002157 match freq: 44match score: 0.000046 matched pair for (the rain) (inSpain):university#0.000717 of&tokyo#0.002547 match freq: 11 match score:0.000050 matched pair for (the rain) (inSpain): time#0.025861 it#0.000103 match freq: 25 match score: 0.000000matched pair for (the rain) (in Spain): time#0.025861 there#0.000086match freq: 11 match score: 0.000000 matched pair for (the rain) (inSpain): time#0.025861 he#0.000103 match freq: 25 match score: 0.000001matched pair for (the rain) (in Spain): year#0.000343 he#0.000103 matchfreq: 24 match score: 0.000000 matched pair for (the rain) (inSpain): year#0.000343 it#0.000103 match freq: 16 match score: 0.000000matched pair for (the rain) (in Spain): year#0.000343 there#0.000086match freq: 9 match score: 0.000000

[0140] This shows elements from a set of matches between elements in theword and word group similarity Table for “(the rain)” and “(in Spain)”.We can see the element “time” in the synthesized entry in word and wordgroup similarity Table for “(the rain)” which was added by the previouslevel of word and word group similarity Table entry estimation shownabove. Now it is taking part in a match to estimate new elements for thenew entry in the word and word group similarity Table for the wholestring “(the rain) (in Spain)” associated not only with that string, butwith that order of combining elements of that string, and we see thescore which we estimated as a sum of products between the match scoresfor the matches which had entries containing “time”, and thesimilarities between those match elements and “time”, is being used as aword and word group similarity score itself, i.e. 0.025861. This ismultiplied with the similarity value for an equivalent of “(in Spain)”for this particular match, e.g. “there” with score 0.000086, and this ismultiplied by the product of the scaling factor of 10000 and the mutualinformation between “time” and “there” occurring in that order, to givea match score for “time there” as a parameter of the estimation of a newword and word group similarity Table entry for “(the rain) (in Spain)”.As at the previous level of iteration, elements are added to the newentry as a sum of scores for elements in contributing entries, wherethose scores are products of the existing similarity between the matchpair and the element in its word and word group similarity Table entry,which is being added to the new entry, multiplied by the match score,and shown in FIGS. 3B-7, as already discussed above. At the end of thisstage of processing, when all elements in a string passed from theapplication to the method as shown in FIG. 1b have been combined bymatching existing and synthesized entries for them in the word and wordgroup similarity Table calculated from the corpus, we have a new wordand word group similarity Table entry for the whole string, with a givenorder of combination. The derivation of new elements for the string andorder of pairing for that string “(the rain) (in Spain)” is given inTable 2-6: TABLE 2-6 New element for Table entry of ((the rain) (inSpain)) is:“ year” with score 0.000000 Added from entries for matchedpairs: time&it#time&there#time&he#year&he#year&it#year&there New elementfor Table entry of ((the rain) (in Spain)) is: “ court” with score0.000013 Added from entries for matched pairs:bank&of&japan#university&of&tokyo New element for Table entry of ((therain) (in Spain)) is: “us&and&Europe” with score 0.000219 Added fromentries for matched pairs: us&and&Europe New element for Table entry of((the rain) (in Spain)) is: “ yen” with score 0.000004 Added fromentries for matched pairs: bank&of&japan

[0141] The system can repeat this process over all possible orders ofcombination of the elements in any given string passed by a naturallanguage processing application and return synthesized Table entries forall of them. A diagrammatic representation of an alternate sequence ofpair matches to create a new word and word group similarity Table entryfor the presented string “the rain in Spain”, in this case associatedwith the order of matching “((the rain) in) Spain)” is shown in FIG. 5.Notice that while the basic processes of matching and synthesizing newword and word group similarity Table entries is the same, the actualelements which are added to the new entries differ according to theorder of pairing. Such a diagram can be made to represent the sequenceof matches for each order of pairing possible between a like number ofwords. These new Table entries can be thought of as representing thegrammar of the string in that order of combination. The number and valueof the elements of these Table entries can be thought of as giving ameasure of the grammaticality or “goodness of fit” of the presentingstring, in that order of combination, with the corpus of textrepresenting the language for the purposes of this calculation. Thevalues of the elements of each such synthesized Table entry can be addedto give a sum, and the order of matches between pairs which produces asynthesized entry with the greatest sum of entry scores can be taken asthe best “parse tree” of that string in the context of the languagerepresented by the corpus. This summing of scores for final Tableentries for a string is shown in the code in FIGS. 3B-2: for each $entry(@parse_match_list) { $score = $score + 10 * ($entry − int $entry): }

[0142] Notice from the Tables of matches and derived equivalents, Tables2-1 to 2-6, that derived equivalents for pairs need not also be pairs.This is because strings of arbitrary length, up to a maximum of 3 inthis embodiment, are related freely to each other, according simply towhether they occur in the same previous and following word contexts ornot. Thus “the rain” in Table 2-2 is related to “bank” via word pairmatches, “japanese market”, “other party” etc., and the single word“time” via even three word matches such as “the soviet union”, and “theunited states”. This means that in theory strings of arbitrary lengthcan be represented by a word or word group similarity Table entryconsisting of single words only. In Table 2-6 we see that the wholestring “(the rain) (in Spain)” is represented by the single words“year”, “court”, and “yen”, as well as the three word string “U.S. andEurope”. This is possible if we have at least words and word pairsrelated together in the word and word group similarity Table becausethen matched pairs can be replaced at each level of matching by singlewords. Otherwise we need to keep information about combinations of wordsup to the full length of the string to be analyzed and meet a severedata sparseness problem.

[0143]FIG. 6 shows a diagrammatic representation of the closest art forthe estimation of a vector grammatical representation, where vector isin the sense of a vector grammatical representation in FIG. 2b, hereusually referred to an “an entry in the word or word group similarityTable”. In FIG. 6 matches are made between entries in a word similarityTable for unobserved pairs and those matches are used to estimateprobabilities for those unobserved pairs. The sets of pairs found in thematches are not used iteratively in further matches to estimategrammatical properties for longer strings and give an order of pairingwhich can be associated with a parse tree, and the sum of similarityscores for the match pairs is not associated with a measure ofgrammaticality of the string. Also word pairs are not related to singlewords so that arbitrarily long strings can be represented as word andword group similarity Table entries consisting only of single words. Itis also a vitally important element of the method shown in FIGS. 4 and 5that not only words, but word pairs, or longer strings of some length,exist in the initial word and word group similarity Table and the methoduses them in word pair estimations at some stage, such as the estimationof new word pair entries in FIGS. 5 and 6, because this means that thedifferent orders of pairing will result in different final synthesizedentries. This is necessary for the method to be used to construct aparse tree. Take for instance the example string “the rain in Spain”. Ifmatches between elements similar to “the” and to “rain” e.g. “the week”,“the party”, “other party” (Table 2-1), were used simplistically tomatch with similarly directly estimated equivalents for “in Spain”, suchas “and Europe”, “of japan” (Table 2-3), and thereby estimate a vector(in the sense of FIG. 3b) to represent “the rain in Spain”, then thefinal “vector” would have elements such as “the party and Europe”, and“other party of japan”. But these would also be elements of a vectorrepresenting any other order of pairwise combination, because allelements in all matched pairs would be in the initial word and wordgroup similarity Table entries for the elements of the initial string.There could be no possible difference for the different orders ofcombination, and thus no possible means to differentiate a parse treefor the string. When elements from Table entries for the matched pairsrather than the matched pairs themselves are substituted, however, somematched pairs, because they are more “natural” grammatical units, willhave larger word and word group similarity Table entries, and thepairing orders associated with these will have larger final word andword group similarity Table entries. They will thus indicate the morenatural pairing order, or parse tree. For instance elements for thesynthesized entries of “the rain” in this case, while they will alsotrivially include “the week”, “the party”, “other party”, will alsoinclude “bank”, “time”, “U.S.”, and “year” as shown in Table 2-2, and itis these which provide the matches in Table 2-5, and thus the additionalelements for the new entry for “(the rain) (in Spain)” shown in Table2-6: i.e. “year”, “court”, “U.S. and Europe”, and “yen”. It is thiswhich distinguishes it from other possible orders of pairwisecombination such as “(the rain) in) Spain)” shown in FIG. 5 according tothe underlying principle of grammaticality where grammaticality isscored as the number and quality (similarity to the posited string) ofelements in its synthesized word and word group similarity Table entry.

[0144] There are four important points to note about the method: Scoringa principle of grammaticality based on the number and quality(similarity to the posited string) of elements in a synthesized(recombined) entry for the posited string in the word and word groupsimilarity Table. Recombination is the process of adding matchestogether to make a this new grammatical representation. It also meansthey can be used e.g. as elements in another match, rather than just asseparate elements to be polled individually as estimators of somegrammatical value, such as probability. This is also essential for bothfinding arbitrary levels of structure and estimating entries for longerstrings using shorter strings, because both require repetition of thematching process. It is essential to have strings longer than length onein the word and word group similarity Table, and to examine differentorders of pairing, for you to be able to calculate structure, or a parsetree (so you get different equivalent word lists for elements fordifferent orders of parsing). It is essential to have longer stringsrelated to shorter strings in the word and word group similarity Table,at least strings of length two related to strings of length one, if youdo not want to have to list Table entries for arbitrarily long strings,in every order of combination, directly from the data beforehand.

[0145] Note that there are three major senses of “score” used by themethod. One is the word or word group similarity score estimateddirectly from the text of the corpus as the mutual information of wordsand word groups in context or other similar statistic. This gives ameasure of the similarity to two words. The second is the match scorewhich is a product of the word or word groups similarity score for theequivalent words of a match and the mutual information of the words inthe match occurring in sequence, this is a mix of the strength of thematch and the similarity of the words taking part in the match to thewords in the unobserved pair. This is then multiplied by the word orword group similarity score of an element being added from an entry fora match pair to the entry being synthesized for some new pair toestimate a similarity value between the element and the new pair (inFIGS. 3B-6). The third major sense of score, and the most important fromthe grammatical point of view, is the sum of word and word groupsimilarity values for a synthesized entry calculated for new word andword group similarity Table entries associated with each possible orderof pairing of the elements of a presented string. This gives a measureof grammaticality or “goodness of fit” and can be used to select anorder of pairing which is associated with a best parse tree for a stringwith respect to a given corpus (code for this sum in FIGS. 3B-2).

[0146] As a general note on Serial vs. Parallel implementation: In thepreferred embodiment here described the classes are represented as setsof similar words in a Table (where similar is defined as occurring in acommon context e.g. the same immediately preceding and following word,though other contexts and other features could be used), but therelationships embodied in these Tables are more naturally expressed bynetworks of word associations. In fact a basic system working along theprinciples here described might be summarized (for a trivial case) bythe network (where links show actual combinations found in a set ofdata):

[0147] My word and word sequence similarity Tables might be seen to beone embodiment of the observation: “A” and “The” both combine with “dog”so the syntax of “A” and “The” are linked (they are “similar” tokens).The more links there are between two words, the more likely combinationsof one will also apply to the other, the more “similar” they are. Andthe “matching” operation which I use to force recombinations of newparadigmatic vector classes might be expressed as the analogy: “A dog”,“The dog” and “The cat” contribute evidence for a new combination—“Acat” (because “A” and “The” are similar, and “The” combines with “cat).Algebraically:

[0148] Table: AB, CB, CD=>A|C, D|B . . .

[0149] Match: AB, CB, CD=>AD

[0150] Thus both the Tables and the match processor in the currentembodiment can be seen to be equivalent to a single network of wordassociation relations, and the match function just extends the relationone (or more) remove of distance.

[0151] With a serial computer it is more efficient to summarize the word(and easily extended to word group) similarity information of thishypothetical network in an index and perform lookups on this index tofind matches, but this does not alter the underlying network nature ofthe method.

[0152] Although applicant has described applicant's preferredembodiments of this invention, it will be understood that the broadestscope of this invention includes such modifications as diverse computerprocessing and apparatus and media. Such scope is limited only by thebelow claims as read in connection with the above specification.

[0153] Further, many other advantages of applicant's invention will beapparent to those skilled in the art from the above descriptions and thebelow claims.

What is claimed is: 1) A computer system, using a provided corpus oflinear natural-language elements of natural language text string data ina subject language and an input string of natural-language elements inthe subject language, for assisting natural-language processing,comprising, in combination: a) for a first adjoining pair, comprising afirst pair element and a second pair element, of such natural-languageelements of such input string, finding, from such string data from suchcorpus, a first listing of each such element syntactically equivalent tosuch first pair element and a second listing of each such elementsyntactically equivalent to such second pair element; b) from matchingeach such first-listing element with each such second-listing element,making a matched-pairs third listing by finding which matched pairs ofsaid matching are found in such string data from such corpus; and c) forsuch matched pairs of such matched-pairs third listing, finding, fromsuch string data from such corpus, a fourth listing of each fourth suchnatural-language element syntactically equivalent to any such matchedpair of said third listing. 2) The computer system according to claim 1further comprising: a) scoring each such natural-language element ofsuch fourth listing, such scoring comprising counting the number ofoccurrences of each such natural-language element of such fourth listingin such string data from such corpus. 3) The computer system accordingto claim 1 further comprising: a) for such fourth natural-languageelements of such fourth listing, finding, from such string data fromsuch corpus, a fifth listing of each such natural-language elementsyntactically equivalent to any such fourth natural-language element. 4)The computer system according to claim 3 further comprising: a) scoringeach such natural-language element of such fifth listing, such scoringcomprising counting the number of occurrences of each suchnatural-language element of such fifth listing in such string data fromsuch corpus. 5) The computer system according to claim 3 furthercomprising: a) for such nth natural-language elements of such nthlisting, finding, from such string data from such corpus, an (n+1)thlisting of each such natural-language element syntactically equivalentto any such nth natural-language element. 6) The computer systemaccording to claim 5 further comprising: a) scoring each suchnatural-language element of such (n+1)th listing, such scoringcomprising counting the number of occurrences of each suchnatural-language element of such (n+1)th listing in such string datafrom such corpus. 7) The computer system according to claim 1 furthercomprising: a) for a second adjoining pair, comprising such firstadjoining pair as a second first pair element and anothernatural-language element adjoining such first adjoining pair as a secondsecond pair element, of such natural-language elements of such inputstring, finding, from such string data from such corpus, a second firstlisting of each such element syntactically equivalent to such secondfirst pair element and a second second listing of each such elementsyntactically equivalent to such second second pair element; b) frommatching each such second first-listing element with each such secondsecond-listing element, making a matched-pairs second third listing byfinding which matched pairs of said matching are found in such stringdata from such corpus; and c) for such matched pairs of suchmatched-pairs second third listing, finding, from such string data fromsuch corpus, a second fourth listing of each second fourth suchnatural-language element syntactically equivalent to any such matchedpair of such second third listing. 8) The computer system according toclaim 7 further comprising: a) scoring each such natural-languageelement of such fourth listing, such scoring comprising counting thenumber of occurrences of each such natural-language element of suchfourth listing in such string data from such corpus. 9) The computersystem according to claim 7 further comprising: a) for an (n+1)thadjoining pair, comprising such nth adjoining pair as an (n+1)th firstpair element and another natural-language element adjoining such nthadjoining pair as an (n+1)th second pair element, of suchnatural-language elements of such input string, finding, from suchstring data from such corpus, an (n+1)th first listing of each suchelement syntactically equivalent to such (n+1)th first pair element andan (n+1)th second listing of each such element syntactically equivalentto such (n+1)th second pair element; b) from matching each such (n+1)thfirst-listing element with each such (n+1)th second-listing element,making a matched-pairs (n+1)th third listing by finding which matchedpairs of said matching are found in such string data from such corpus;and c) for such matched pairs of such matched-pairs (n+1)th thirdlisting, finding, from such string data from such corpus, an (n+1)thfourth listing of each (n+1)th fourth such natural-language elementsyntactically equivalent to any such matched pair of such (n+1)th thirdlisting. 10) The computer system according to claim 9 furthercomprising: a) scoring each such natural-language element of such(n+1)th fourth listing, such scoring comprising counting the number ofoccurrences of each such natural-language element of such (n+1)th fourthlisting in such string data from such corpus. 11) The computer systemaccording to each of claim 1 further comprising: a) repeating such stepsof claim 1 while considering i) such original first adjoining pair as anew first pair element in such repeating, ii) such original fourthlisting as a new first listing in such repeating, and iii) a newnatural-language element adjoining, in such input string, such new firstpair element as a new second pair element, thereby providing a new firstadjoining pair, iv) thereby providing a new fourth listing inassociation with such new first adjoining pair. 12) The computer systemaccording claim 11 further comprising: a) re-performing steps a)1)through a)iv) of claim 11 while considering i) such new first adjoiningpair as a first replacement first pair element in such re-performing,ii) such new fourth listing as a first replacement first listing in suchre-performing, and iii) a further new natural-language elementadjoining, in such input string, such first replacement first pairelement as a first replacement second pair element, thereby providing afirst replacement first adjoining pair, iv) thereby providing a firstreplacement fourth listing in association with such first replacementfirst adjoining pair. 13) The computer system according claim 12 furthercomprising: a) further continuing to perform, for such entire inputstring, steps a)I) through a)iv) of claim 12 while considering i) suchnth first adjoining pair as an (n+1)th replacement first pair element insuch further performing, ii) such nth fourth listing as an (n+1)threplacement first listing in such further performing, and iii) a furthernew natural-language element adjoining, in such input string, such(n+1)th replacement first pair element as an (n+1)th replacement secondpair element, thereby providing an (n+1)th replacement first adjoiningpair, iv) thereby providing an (n+1)th replacement fourth listing inassociation with such (n+1)th replacement first adjoining pair. 14) Thecomputer system according to claim 13 further comprising: a) for an(n+1)th adjoining pair, comprising such nth adjoining pair as an (n+1)thfirst pair element and another natural-language element adjoining suchnth adjoining pair as an (n+1)th second pair element, of suchnatural-language elements of such input string, finding, from suchstring data from such corpus, an (n+1)th first listing of each suchelement syntactically equivalent to such (n+1)th first pair element andan (n+1)th second listing of each such element syntactically equivalentto such (n+1)th second pair element; b) from matching each such (n+1)thfirst-listing element with each such (n+1)th second-listing element,making a matched-pairs (n+1)th third listing by finding which matchedpairs of said matching are found in such string data from such corpus;and c) for such matched pairs of such matched-pairs (n+1)th thirdlisting, finding, from such string data from such corpus, an (n+1)thfourth listing of each (n+1)th fourth such natural-language elementsyntactically equivalent to any such matched pair of such (n+1)th thirdlisting. 15) The computer system according to claim 14 furthercomprising: a) scoring each such natural-language element of such(n+1)th fourth listing, such scoring comprising counting the number ofoccurrences of each such natural-language element of such (n+1)th fourthlisting in such string data from such corpus; b) wherein said scoringcomprises a similarity measure for statistical similarity between suchscored natural-language element and such string data from such corpus;and c) wherein such scores for each such natural language element ofsuch (n+1)th fourth listing are essentially added to determine a scoringfor a string comprising such (n+1)th replacement first adjoining pair.16) The computer system according to claim 15 wherein such computersystem is applied to possible ordered string subcombinations of at leasttwo potential parses of such natural-language elements of such inputstring and a highest such scoring among such potential parses is used todetermine maximum grammaticality among such potential parses. 17) Thecomputer system according to each of claims 2, 4, 6, 8, and 10, and 15wherein said scoring comprises: a) a similarity measure for statisticalsimilarity between such scored natural-language element and such stringdata from such corpus. 18) The computer system according to each ofclaims 2, 4, 6, 8, and 10, and 15 wherein such scoring of each suchfourth list element comprises: a) the product of i) a measure ofstatistical similarity between each such element (of such first listing)syntactically equivalent to such first pair element and such first pairelement; ii) a measure of statistical similarity between each suchelement (of such second listing) syntactically equivalent to such secondpair element and such second pair element; iii) a measure of statisticalassociation between such first and second pair elements; and iv) ameasure of statistical similarity between each matched pair of suchmatched-pairs third listing and each fourth such natural-languageelement of such fourth listing; and b) the sum of each such product foreach such third list element. 19) A computer system, using a providedcorpus of linear natural-language elements of natural language textstring data in a subject language and an input string, to be parsed, ofnatural-language elements in the subject language, for assistingnatural-language parsing, comprising, in combination: a) for each of atleast two natural-language input subcombinations which are potentialsubparses of such input string, building an equivalence list of allcorpus strings syntactically equivalent to such each input stringsubcombination; b) from such equivalence lists, in different orders foreach potential parse of said input string, building to a finalequivalence list for each such potential parse of such input string; andc) from the number and quality of entries in each respective such finalequivalence list, scoring the grammaticality of such respectivepotential parse. 20) The computer system according to claim 19 whereinsuch scoring comprises essentially adding scores for each such entry toobtain a score for such potential parse. 21) A computer system, using aprovided corpus of linear natural-language elements of natural languagetext string data in a subject language and an input string ofnatural-language elements in the subject language, for assistingnatural-language processing, comprising, in combination: a) for a firstadjoining pair, comprising a first pair element and a second pairelement, of such natural-language elements of such input string,finding, from such string data from such corpus, a first listing of eachsuch element syntactically equivalent to such first pair element and asecond listing of each such element syntactically equivalent to suchsecond pair element; and b) from matching each such first-listingelement with each such second-listing element, making a matched-pairsthird listing by finding which matched pairs of said matching are foundin such string data from such corpus; c) wherein at least one of saidfirst adjoining pair comprises at least a pair of natural-languageelements. 22) The computer system according to claim 21 wherein: a) atleast one of such first pair element and such second pair elementcomprises at least a pair of words. 23) The computer system according toeach of claims 1-21 wherein each such pair element comprises at leastone word. 24) The computer system according to each of claims 1-21wherein each such pair element comprises at least two words. 25) Acomputer-readable medium (for a computer system, using a provided corpusof linear natural-language elements of natural language text string datain a subject language and an input string of natural-language elementsin the subject language, for assisting natural-language processing)whose contents cause a computer system to determine a grammatical parseby: a) for each of at least two natural-language input subcombinationswhich are potential subparses of such input string, building anequivalence list of all corpus strings syntactically equivalent to sucheach input string subcombination; b) from such equivalence lists, indifferent orders for each potential parse of said input string, buildingto a final equivalence list for each such potential parse of such inputstring; and c) from the number and quality of entries in each respectivesuch final equivalence list, scoring the grammaticality of suchrespective potential parse. 26) A computer-implemented natural-languagesystem (for a computer system, using a provided corpus of linearnatural-language elements of natural language text string data in asubject language and an input string of natural-language elements in thesubject language, for assisting natural-language processing) comprising:a) for each of at least two natural-language input subcombinations whichare potential subparses of such input string, means for building anequivalence list of all corpus strings syntactically equivalent to sucheach input string subcombination; b) means for building, from suchequivalence lists, in different orders for each potential parse of saidinput string, to a final equivalence list for each such potential parseof such input string; and c) means for scoring, from the number andquality of entries in each respective such final equivalence list, thegrammaticality of such respective potential parse.